1. Audit of current scripts and refactor requirements

1.1 What exists today (must not lose)

From rul_estimator.py (simple AR1):

From enhanced_rul_estimator.py:

From rul_common.py:

1.2 Gaps / issues that must be addressed in refactor

Structural:

Schema / contracts:

Analytics / robustness:

1.3 Your explicit constraints


2. Detailed refactor plan for Copilot (task backlog)

Single table; each row is a concrete change Copilot should implement.

Task ID Priority Category Scope (files / modules) Description (What to do) Key Implementation Details / Constraints Dependencies
RUL-REF-01 H Module layout New rul_engine.py (or rename current rul_estimator.py) Create a single RUL module that will replace both rul_estimator.py and enhanced_rul_estimator.py. Start by copying the richer parts from enhanced_rul_estimator.py into a new module. Do not keep two public estimate_rul_and_failure functions. There will be a single public API run_rul(...) (see later tasks). None
RUL-REF-02 H Helpers / Config rul_common.py + new rul_engine.py Move norm_cdf and RULConfig from rul_common.py into the new RUL module as the canonical definitions. Remove redundant imports of core.rul_common from existing files. Keep RULConfig fields from rul_common.py and extend if needed later; but enforce that this one definition is referenced everywhere (no duplicate dataclasses). Optionally leave rul_common.py as a thin re-export pointing at the new module for backward compatibility. RUL-REF-01
RUL-REF-03 H Config loading New rul_engine.py Implement load_rul_config(sql_client, equip_id, config_row=None) -> RULConfig. Logic: 1) start from default RULConfig(). 2) Overlay with config_row if provided (fields like health_threshold, max_forecast_hours, RUL bands, etc.). 3) Optionally overlay per-asset overrides from a dedicated RUL config table (e.g. ACM_RUL_Config) if it exists. No CSV. Keep this function small and pure (no logging beyond a single “[RUL] Loaded RULConfig for EquipID=…” line). RUL-REF-02
RUL-REF-04 H IO – health Replace _load_health_timeline in both estimators with a unified function in new module Create a single load_health_timeline(sql_client, equip_id, run_id, output_manager, cfg) -> (df, data_quality_flags) in the new module. Behaviour: Priority 1: try output_manager.get_cached_table("ACM_HealthTimeline") (or whatever key is used today). Priority 2: read from SQL via sql_client.cur using the current ACM_HealthTimeline query; remove CSV fallback (health_timeline.csv). Normalise timestamp column names; enforce timezone-naive policy. Add basic data quality flags (SPARSE, GAPPY, FLAT, OK) based on point count, variance, and time gaps. Replace both old _load_health_timeline implementations with calls to this function. RUL-REF-01
RUL-REF-05 H IO – sensor hotspots Replace sensor hotspot loading in both estimators Implement load_sensor_hotspots(sql_client, equip_id, run_id) -> pd.DataFrame in the new module, SQL-only. Read from the existing sensor hotspots table or view (same query as today but no CSV fallback). Ensure final columns match what attribution logic expects: RunID, EquipID, FailureTime, SensorName, FailureContribution, ZScoreAtFailure (or Z), AlertCount (align this with AboveAlertCount mapping). If SQL returns a different column name (e.g. AboveAlertCount), rename once inside this function and use AlertCount consistently downstream. RUL-REF-01
RUL-REF-06 H IO – learning state enhanced_rul_estimator.py JSON → SQL Replace JSON-file-based LearningState.save/load with SQL-backed persistence. Design a small table ACM_RUL_LearningState (or reuse an existing one): columns like EquipID, ModelName, MAE, RMSE, Bias, RecentErrorsJson, CalibrationFactor, LastUpdated. Implement load_learning_state(sql_client, equip_id) -> LearningState and save_learning_state(sql_client, equip_id, state) -> None. Remove use of tables_dir / rul_learning_state_*.json and the corresponding JSON operations. Keep truncation of error history and prediction history as in current LearningState, but store in SQL as JSON text or similar. RUL-REF-02
RUL-REF-07 M IO – RUL history New rul_engine.py Implement load_rul_history(sql_client, equip_id) -> pd.DataFrame used to calibrate RUL vs actual failures. Read from existing RUL history table if available (e.g. ACM_RUL_History or equivalent). At minimum, need: RunID, EquipID, PredictedRUL, ActualFailureTime, PredictionTime. If this table does not yet exist, design schema and create later. The function should return empty DataFrame if no records; engine must handle that gracefully. RUL-REF-06
RUL-REF-08 M IO – anomaly energy New rul_engine.py Implement load_anomaly_energy(sql_client, equip_id, run_id) -> Optional[pd.DataFrame]. For now, treat this as optional. Try to read a standard table/view (e.g. ACM_AnomalyEnergy_TS) that gives Timestamp, CumulativeEnergy or similar. If not present, return None. This enables the energy-based RUL path in compute_rul_multipath without hard-coding CSV paths. RUL-REF-01
RUL-REF-09 H IO – outputs New rul_engine.py Implement SQL writers: write_rul_summary, write_rul_forecasts, write_rul_attribution, write_maintenance_reco. Each writer takes a DataFrame plus sql_client and performs INSERT into the correct tables (ACM_RUL_Summary, ACM_HealthForecast_TS, ACM_FailureForecast_TS, ACM_RUL_TS, ACM_RUL_Attribution, ACM_MaintenanceRecommendation). Preserve existing schemas and cleanup semantics (RunID, EquipID, CreatedAt). No CSV/file output. RUL-REF-01
RUL-REF-10 H Forecast cleanup New rul_engine.py Move forecast cleanup logic (using ACM_FORECAST_RUNS_RETAIN) into a single helper cleanup_old_forecasts(sql_client, equip_id). Use the more complete version from the enhanced estimator, which loops over ["ACM_HealthForecast_TS", "ACM_FailureForecast_TS"] and deletes rows for older RunIDs, keeping N most recent based on MAX(CreatedAt). Ensure the WITH CTE / ROW_NUMBER() logic is kept intact. Call this helper at the top of run_rul before inserting new forecast TS rows. RUL-REF-09
RUL-REF-11 H Model layer – base New rul_engine.py Move DegradationModel base class from enhanced estimator into the new module. Keep fit / predict signatures but remove any file or JSON references. It should be purely numerical. Ensure subclasses (AR1Model, ExponentialDegradationModel, WeibullInspiredModel) subclass this base. RUL-REF-01
RUL-REF-12 H Model layer – AR1 New rul_engine.py Move AR1Model from enhanced estimator and make it the only AR1 implementation used by the engine. Remove any old AR(1) logic embedded in the simple estimator; all AR behaviour should pass through AR1Model.fit/predict. Ensure it handles recent-window training (configurable via RULConfig – e.g. last N points or last N hours). RUL-REF-11
RUL-REF-13 M Model layer – Exponential & Weibull New rul_engine.py Move ExponentialDegradationModel and WeibullInspiredModel from enhanced estimator into the new module. Keep their current parameter fitting logic and error handling, but remove any CSV/file dependencies. Make them robust to short or noisy series (return a “fit_failed” flag that the ensemble can down-weight). RUL-REF-11
RUL-REF-14 H Ensemble orchestration New rul_engine.py Implement a single RULModel class that wraps AR1, Exponential, and Weibull models and always runs an ensemble. RULModel should: 1) Accept RULConfig and LearningState. 2) fit(t, h) fits all models; models that fail are marked. 3) forecast(t_future) returns combined mean/std plus per-model mean/std and per-model weights derived from LearningState and config (min_model_weight, etc.). Ensemble weights must be normalised and clamped so no model has weight below min_model_weight after normalisation. RUL-REF-12, RUL-REF-13, RUL-REF-06
RUL-REF-15 M Failure distribution New rul_engine.py Implement a helper on RULModel (or standalone) to compute failure probability curve over time from the ensemble health forecast. Use threshold crossing logic: for each future time, approximate the probability that health has dropped below cfg.health_threshold, using the predicted mean and std (Gaussian assumption) and norm_cdf. Optionally extend with Monte Carlo sampling if needed later. Output: Timestamp, FailureProb, and optionally FailurePDF. This replaces ad-hoc hazard logic, but must still conform to what compute_rul_multipath expects (a hazard_df with FailureProb and Timestamp). RUL-REF-14
RUL-REF-16 H LearningState consolidation New rul_engine.py Move ModelPerformanceMetrics and LearningState from enhanced estimator and adapt them to the SQL persistence. Ensure LearningState no longer references Path or JSON. It should be a pure in-memory object. Keep fields: per-model MAE/RMSE/bias/error lists, global calibration_factor, last_updated, and prediction_history. Implement a separate adapter that maps between LearningState and the SQL ACM_RUL_LearningState table. RUL-REF-06
RUL-REF-17 H Learning update New rul_engine.py Implement update_learning_state(learning_state, rul_history_df, cfg) which updates metrics and calibration based on past runs. Use rul_history_df (predicted RUL vs actual failure) to update MAE/RMSE/bias for each model and overall calibration factor. Trim all histories to cfg.calibration_window. Clamp calibration factor to reasonable bounds (e.g. [0.5, 2.0]) to avoid extreme uncertainty rescaling. RUL-REF-16, RUL-REF-07
RUL-REF-18 H Single engine core New rul_engine.py Implement one function compute_rul(health_df, cfg, learning_state, anomaly_energy_df, data_quality_flags) -> dict. This is the only engine. Steps: 1) Data conditioning (sort, deduplicate timestamps, handle gaps, optional resample to median step, set data_quality_flags). 2) Detect sampling interval and build future index up to cfg.max_forecast_hours. 3) Instantiate RULModel with current learning_state and cfg. 4) Fit on conditioned health history (use recent window policy from config). 5) Forecast health mean/std over horizon. 6) Compute failure probability curve via RUL-REF-15. 7) Call compute_rul_multipath (see next task) to derive trend/hazard/energy RULs and select final RUL. 8) Build output dict with health_forecast_df, failure_curve_df, rul_ts_df, and internal diagnostics. There should be no “simple vs enhanced” branch here. RUL-REF-04, RUL-REF-08, RUL-REF-14, RUL-REF-15, RUL-REF-17
RUL-REF-19 H Multipath RUL Adapt compute_rul_multipath from enhanced estimator Move and adapt compute_rul_multipath into the new module so it works with the ensemble outputs and SQL-only context. Standardise the expected columns of health_forecast_df: Timestamp, HealthIndex, CI_Lower, CI_Upper, ForecastStd. Build hazard_df from the failure curve (Timestamp, FailureProb). If anomaly_energy_df is present, use its Timestamp & CumulativeEnergy. Compute: 1) trajectory-based RUL (when mean crosses threshold), 2) hazard-based RUL (when FailureProb ≥ threshold), 3) energy-based RUL. Select RUL_Selected and SelectedPath based on rules (e.g. choose earliest RUL among consistent paths). RUL-REF-18, RUL-REF-15
RUL-REF-20 M Confidence computation New rul_engine.py Implement a single confidence calculation combining CI width, model agreement, and calibration stability. Start from existing logic in enhanced estimator: narrower CI and high agreement → higher confidence. Add caps/clamping so confidence stays in [0,1], and smooth mapping (e.g. monotonic function of normalised CI width). Use calibration_factor stability (how much it has changed recently) to penalise confidence when the model is still “learning”. Store confidence in both time-series (optional) and summary. RUL-REF-18, RUL-REF-17
RUL-REF-21 M RUL distribution & bands New rul_engine.py From failure curve, derive RUL distribution and maintenance bands. Convert failure CDF into RUL (time to failure) distribution; compute quantiles (P10, P50, P90). Map RUL and confidence into bands like Normal, Watch, Plan, Urgent based on thresholds from RULConfig (e.g. band edges in hours). Return these fields in the summary and feed into maintenance recommendation builder. RUL-REF-18, RUL-REF-20
RUL-REF-22 H Attribution builder New rul_engine.py Implement a unified build_sensor_attribution(sensor_hotspots_df, rul_result, cfg) -> pd.DataFrame. Base this on the more complete implementation in the enhanced estimator. Use SQL-loaded sensor hotspots (RUL-REF-05) only. Ensure columns: RunID, EquipID, FailureTime, SensorName, FailureContribution, ZScoreAtFailure, AlertCount, Comment (if used). Fix column name mismatch (AboveAlertCount vs AlertCount) here once. RUL-REF-05, RUL-REF-18
RUL-REF-23 M Maintenance recommendations New rul_engine.py Implement build_maintenance_recommendation(rul_result, bands, data_quality_flags, cfg) -> pd.DataFrame. Use RUL_Selected, quantiles, confidence, data quality flags, and band classification (RUL-REF-21) to build structured recommendations (e.g. “Monitor”, “Plan at next shutdown”, “Immediate inspection”). Output to ACM_MaintenanceRecommendation schema. The engine should be the single source of these recommendations (no second implementation). RUL-REF-21
RUL-REF-24 H Output shaping New rul_engine.py Implement helpers: make_summary_row, make_health_forecast_df, make_failure_curve_df, make_rul_ts_df. Ensure all DataFrames match the existing SQL schemas and column names used by current dashboards (including RunID, EquipID, Timestamp, HealthIndex, CI_Lower/CI_Upper, etc.). Also fix the CI naming inconsistency: pick one convention (e.g. CI_Lower, CI_Upper) and update both creation and consumers (including compute_rul_multipath) to use it consistently. RUL-REF-18, RUL-REF-19
RUL-REF-25 H Public API New rul_engine.py and acm_main Implement the single public function run_rul(sql_client, equip_id, run_id, output_manager=None, config_row=None) -> dict. Inside run_rul: 1) Load cfg via load_rul_config. 2) Optionally clean up old forecasts (RUL-REF-10). 3) Load health TS (RUL-REF-04). 4) Load learning state (RUL-REF-06). 5) Load anomaly energy (RUL-REF-08). 6) Call compute_rul. 7) Build sensor attribution and maintenance reco. 8) Write all outputs to SQL via RUL-REF-09. 9) Save updated learning state to SQL. 10) Return dict of DataFrames for OutputManager/artifact handling. Update acm_main to call this function only, and remove calls to the old estimators. RUL-REF-04, RUL-REF-06, RUL-REF-18, RUL-REF-22, RUL-REF-23, RUL-REF-24
RUL-REF-26 M Type normalisation New rul_engine.py Centralise RunID/EquipID handling. At the start of run_rul, normalise run_id to str and equip_id to int. Ensure all output DataFrames use these normalised types. Remove duplicated RunID/EquipID insertion logic scattered across old code. RUL-REF-25
RUL-REF-27 M Logging New rul_engine.py Standardise logging across all RUL paths. Use consistent prefixes like [RUL], [RUL-Model], [RUL-Learn], [RUL-Multipath]. Every early-exit path should log equip_id, run_id, and reason (e.g. “too few points”, “no health timeline”). Keep logs free of file path references (since CSV/JSON is removed). RUL-REF-25
RUL-REF-28 M Regime awareness (optional) New rul_engine.py Add optional regime handling to the engine. Extend run_rul and/or compute_rul to optionally accept a regime_id or regime series if available (from health TS or separate table). Use this to restrict training window to current regime or to weight recent regime more heavily. Keep this behind config flags so engine is still single and deterministic when regimes are not provided. RUL-REF-18
RUL-REF-29 H Remove CSV and legacy engines Old rul_estimator.py, enhanced_rul_estimator.py, rul_common.py Remove all CSV references and redundant engine functions. After the new engine is wired: 1) Delete or deprecate CSV fallbacks (health_timeline.csv, sensor_hotspots.csv etc.) from old modules. 2) Remove JSON LearningState save/load paths. 3) Either delete old estimate_rul_and_failure implementations or keep thin wrappers that immediately call run_rul for backward compatibility, without CSV or second engine logic. 4) Ensure project references to core.rul_common are updated if that file is reduced to re-exports only. RUL-REF-25
RUL-REF-30 M Sanity checks / harness New test harness script / notebook Add a small harness that exercises run_rul end-to-end. The harness should: 1) Run RUL for a few historical runs per asset. 2) Check that SQL tables receive rows with correct schemas and non-null core fields. 3) Log RUL summary (RUL_Selected, bands, confidence, data quality) for manual review. No CSV involved. RUL-REF-25